Skip to content

feat(metrics): expose prefilter registry state via Prometheus#209

Merged
JustMaier merged 1 commit intomainfrom
ivy-prefilter-metrics
Apr 14, 2026
Merged

feat(metrics): expose prefilter registry state via Prometheus#209
JustMaier merged 1 commit intomainfrom
ivy-prefilter-metrics

Conversation

@JustMaier
Copy link
Copy Markdown
Contributor

Summary

Final Phase-1 prefilter follow-up: wires the design doc's Prometheus metrics (plus one extra age gauge) so Grafana can show substitution rate once `civitai_safe` is registered on v1.0.219.

The atomics were already tracked on `PrefilterEntry` (from PR #207); this is pure scrape-side plumbing. Collect-on-scrape fits the existing pattern in `handle_metrics`.

Metrics added

Name Type Labels Purpose
`bitdex_prefilter_registered` gauge index number of registered entries
`bitdex_prefilter_cardinality` gauge index, name current bitmap cardinality
`bitdex_prefilter_substitutions_total` gauge-as-counter index, name queries that substituted this prefilter (cumulative)
`bitdex_prefilter_last_compute_seconds` gauge index, name seconds spent on the last compute/refresh
`bitdex_prefilter_refresh_errors_total` gauge-as-counter index, name cumulative refresh errors
`bitdex_prefilter_age_seconds` gauge index, name seconds since last successful refresh (alert candidate — grows unbounded if no one's refreshing)

`substitutions_total` and `refresh_errors_total` are surfaced as gauges read from `AtomicU64` counters on the entry — same pattern as the existing `cache_hits_total` / `cache_misses_total` gauges.

Test plan

  • `cargo test --lib prefilter::` — 17/17 green
  • `cargo test --test prefilter_integration` — 3/3 green (ran 3× to check for flakes)
  • Deploy as v1.0.220 once v1.0.219 is stable. No schema or on-disk changes.
  • After deploy: register `civitai_safe`, verify `bitdex_prefilter_registered{index="civitai"} == 1` and `bitdex_prefilter_substitutions_total` rises under feed traffic.

Follow-ups remaining

🤖 Generated with Claude Code

Wires the 5 metrics from design-prefilter-registry.md plus one extra age
gauge into the collect-on-scrape path. Atomics were already tracked on
PrefilterEntry from PR #207; this PR adds scrape-side plumbing so Grafana
can show registered count, cardinality, substitution rate, last compute
duration, refresh errors, and age-since-refresh.

- bitdex_prefilter_registered{index} — gauge, count of registered entries
- bitdex_prefilter_cardinality{index,name} — gauge, current bitmap size
- bitdex_prefilter_substitutions_total{index,name} — counter surfaced as gauge
  (matches existing cache_hits_total pattern — the value comes from an
  AtomicU64 counter on the entry, not from a live Prometheus counter)
- bitdex_prefilter_last_compute_seconds{index,name} — gauge
- bitdex_prefilter_refresh_errors_total{index,name} — gauge
- bitdex_prefilter_age_seconds{index,name} — gauge, alert candidate: if an
  SWR thread or orchestrator isn't refreshing, age will grow unbounded

Tests: lib 17/17 + integration 3/3 still green. No behavior change to the
substitute() hot path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JustMaier added a commit that referenced this pull request Apr 14, 2026
Closes design Goal 2 ("Stale-while-revalidate — periodic refresh without
blocking queries") and the last Phase-1 gap called out by the post-merge
Plan Review.

A single dedicated thread per engine, spawned from the server's boot
phase 7 (after eager preload + bound cache). Each tick (default 10s):

  for entry in registry.entries():
      if entry.is_stale(now): refresh_prefilter(entry.name)

`refresh_prefilter` recomputes off the query path and atomically swaps
the bitmap via ArcSwap — in-flight queries holding the old Arc continue
reading; next query sees the new one.

Lifecycle safety:
- Holds `Weak<ConcurrentEngine>`, so engine drop breaks the cycle.
- Checks `self.shutdown` between entries (not just per-tick), so the
  thread can exit promptly even if it's mid-refresh on a stale entry.

Integration test `swr_thread_refreshes_stale_prefilters` verifies
end-to-end: register 10 docs → insert 5 more → rewind last_refreshed →
cardinality flips to 15 within 5s.

With the age gauge from PR #209, ops can alert on SWR thread liveness
(bitdex_prefilter_age_seconds > 2 × refresh_interval = something is
wedged).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JustMaier JustMaier merged commit 9ad72d4 into main Apr 14, 2026
1 check failed
JustMaier added a commit that referenced this pull request Apr 14, 2026
Closes design Goal 2 ("Stale-while-revalidate — periodic refresh without
blocking queries") and the last Phase-1 gap called out by the post-merge
Plan Review.

A single dedicated thread per engine, spawned from the server's boot
phase 7 (after eager preload + bound cache). Each tick (default 10s):

  for entry in registry.entries():
      if entry.is_stale(now): refresh_prefilter(entry.name)

`refresh_prefilter` recomputes off the query path and atomically swaps
the bitmap via ArcSwap — in-flight queries holding the old Arc continue
reading; next query sees the new one.

Lifecycle safety:
- Holds `Weak<ConcurrentEngine>`, so engine drop breaks the cycle.
- Checks `self.shutdown` between entries (not just per-tick), so the
  thread can exit promptly even if it's mid-refresh on a stale entry.

Integration test `swr_thread_refreshes_stale_prefilters` verifies
end-to-end: register 10 docs → insert 5 more → rewind last_refreshed →
cardinality flips to 15 within 5s.

With the age gauge from PR #209, ops can alert on SWR thread liveness
(bitdex_prefilter_age_seconds > 2 × refresh_interval = something is
wedged).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant